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Abstract 

Objections to Darwinian evolution are often based on the time required to carry out 
the necessary mutations. Seemingly, exponential numbers of mutations are needed. We 
show that such estimates ignore the effects of natural selection, and that the numbers 
of necessary mutations are thereby reduced to about K log L, rather than K , where L 
is the length of the genomic "word," and K is the number of possible "letters" that can 
occupy any position in the word. The required theory makes contact with the theory 
of radix-exchange sorting in theoretical computer science, and the asymptotic analysis 
of certain sums that occur there. 



1 Introduction 

The 2009 "Year of Darwin" has seen many welcome tributes to this great scientist, and re- 
affirmations of the validity of his theory of evolution by natural selection, though this validity 
is not universally accepted. One of the main objections that have been raised holds that 
there has not been enough time for all of the species complexity that we see to have evolved 
by random mutations. Our purpose here is to analyze this process, and our conclusion is 
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that when one takes account of the role of natural selection in a reasonable way, there has 
been ample time for the evolution that we observe to have take place. 

2 The calculations 

Biological evolution is such a complex process that any attempt to describe it precisely 
in a way similar to the description of the dynamic processes in physics by mathematical 
methods is impossible. This does not mean, however, that arbitrary models of biological 
evolution are allowed. Any allowable model has to reflect the main features of evolution. 
Our main aims, discussed below, are to indicate why an evolutionary model often used to 
"discredit" Darwin, leading to the "not enough time" claim, is inappropriate, and to find 
the mathematical properties of a more appropriate model. 

Before doing this we take up some other points. Evolution as a Darwinian-Mendelian 
process takes place via a succession of gene replacement processes, whereby a new "superior" 
gene arises by mutation in the population and, by natural selection, steadily replaces the 
current gene. (We use here the word "gene" rather than the more technically accurate 
"allele".) It has recently been estimated [2] that a newborn human carries some de novo 
100-200 base mutations. Only about five of these can be expected, on average, to arise in 
parts of the genome coding for genes or in regulatory regions. In a population admitting a 
million births in any year, we may expect something on the order of five million such de novo 
mutations, or about 250 per gene in a genome containing 20,000 genes. There is then little 
problem about a supply of new mutations in any gene. However only a small proportion of 
these can be expected to be favorable. We formalize this in the calculations below. 

We now turn to the inappropriate evolutionary model referred to above concerning the 
fixation of these genes in the population. The incorrect argument runs along the following 
lines. Consider the replacement processes needed in order to change each of the resident 
genes at L loci in a more primitive genome into those of a more favorable, or advanced, 
gene. Suppose that at each such gene locus, the argument runs, the proportion of gene 
types (alleles) at that gene locus that are more favored than the primitive type is K~ l . The 
probability that at all L loci a more favored gene type is obtained in one round of evolutionary 
"trials" is K~ L , a vanishingly small amount. When trials are carried out sequentially over 
time, an exponentially large number of trials (of order K L ) would be needed in order to carry 
out the complete transformation, and from this some have concluded that the evolution-by- 
mutation paradigm doesn't work because of lack of time. 

But this argument in effect assumes an "in series" rather than a more correct "in par- 
allel" evolutionary process. If a superior gene for (say) eye function has become fixed in 
a population, it is not thrown out when a superior gene for (say) liver function becomes 
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fixed. Evolution is an "in parallel" process, with beneficial mutations at one gene locus 
being retained after they become fixed in a population while beneficial mutations at other 
loci become fixed. In fact this statement is essentially the principle of natural selection. 

The paradigm used in the incorrect argument is often formalized as follows. Suppose that 
we are trying to find a specific unknown word of L letters, each of the letters having been 
chosen from an alphabet of K letters. We want to find the word by means of a sequence of 
rounds of guessing letters. A single round consists in guessing all of the letters of the word 
by choosing, for each letter, a randomly chosen letter from the alphabet. If the correct word 
is not found, a new sequence is guessed, and the procedure is continued until the correct 
sequence is found. Under this paradigm the mean number of rounds of guessing until the 
correct sequence is found is indeed K L . 

But a more appropriate model is the following. After guessing each of the letters, we 
are told which (if any) of the guessed letters are correct, and then those letters are retained. 
The second round of guessing is applied only for the incorrect letters that remain after this 
first round, and so forth. This procedure mimics the "in parallel" evolutionary process. The 
question concerns the statistics of the number of rounds needed to guess all of the letters of 
the word successfully. Our main result is 

Theorem 1 The mean number of rounds that are necessary to guess all of the letters of an 
L letter word, the letters coming from an alphabet of K letters, is 

1< ' gL ^- + /3(L) + 0(L- 1 ) (L->oo). (1) 



log 



K-l- 



with (3(L) being the periodic function of log L that is given by eq. (0) below. The function 
(3(L) oscillates within a range which for K > 2, is never larger than .000002 about the first 
two terms on the right-hand side of equation |7|). 

For example, if we are using a K = 40 letter alphabet, and a word of length 20,000 
letters, then the number of possible words is about 10 34,040 , but our theorem shows that a 
mean of only about 

log 20, 000 



log (|) 
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rounds of guessing will be needed, where each round consists of one pass through the entire 
as-yet-unguessed word. 

The central feature of this result lies in the logarithmic terms in the above expression. 
Even if L is very large, log L is (for values of L arising in practice in any genome) in 
practice manageable. The inappropriate arguments referred to above lead to the value 
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10 34 ' 040 , and arise because of the incorrect "in series" rather than the correct "in parallel" 
implicit assumption about the nature of genetic evolution. 

We have chosen the numerical values in this example to reflect the biological evolution- 
ary process. The value 20,000 represents the number of gene loci in the genome at which 
replacement processes are to take place. The value K = 40 is arrived at by using the value 
250 found above for the number of de novo mutations per gene locus per year and a rough 
estimate that only one mutation in 10,000 is selectively favored over the resident gene type. 
In practice further modifications are needed to the calculations since, because of stochastic 
events, only a proportion of selectively favored new mutations become fixed in a population. 

However, the force of our result does not depend on the numerical values that one assigns 
to K and L. The fact is that with the parallel model, i.e., taking account of natural selection, 
the number of rounds of mutations that are needed to change the complete genome to its 
desirable form are only about KlogL, instead of the hugely exponential K L which would 
result from the serial model. 



3 The analysis 



The probability that the first letter of the word will be correctly guessed in at most r rounds 
of guessing is 

V K, 

so the probability that all L letters of the given word will be guessed correctly in < r rounds 
is 

Thus the mean number of rounds that will be needed to guess all of the letters of the word 
is 

1 \ r \^ ( / 1 x r ~ ^ ^ 



which is simply the mean of the maximum of L independently and identically distributed 
(iid) geometric random variables. 

This infinite sum can be transformed into a finite sum because 

Er{(i-*r-(i-o L } = ErE{(^)(-l)V^ 

= f (l\ (-iy +i 
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Consequently the mean number of rounds needed to guess all L letters is 

°W = t ( ■) i J^ K y = '4^)^ (X-KKK-l)) (2) 

The appearance of this latter sum in the current context is somewhat surprising. It is one 
which is well known to theoretical computer scientists, and it arises there in connection with 
radix-exchange sorting. 

To find the asymptotic behavior of this sum, we note that the behavior of the following 
sum is known, and can be found in Exercise 50 of section 5.2.2 of pQ: 

^»=g(;)^T, («->i). (3) 

The result in [T], due to N.G. deBruijn, is that 

(l — 11 \ m 1 1 

U m , n = nfog m n + n[{ +/_ 1 ( n ) + £(„) + 0(n~ x \ (4) 

\\ogm 2 J m — 1 21ogm 2 

where 7 = .57721.. is Euler's constant and 

2 

f s (n) = V3ft(r(s-27rzfc/logm)exp (27rifclog m (n))). 

These / s (n)'s are bounded functions, and in fact, they are evidently periodic of period 1 in 
log m (n). 

To relate our sum to Knuth's, we have 

a(L) = 1 + U X)L+1 - U XtL . (X = K/(K - 1)). 

We note in passing that this shows that the mean of L iid geometric random variables is, 
quite generally, simply related to the quantities U m>n , which had previously been encoun- 
tered in connection with radix-exchange sorting, and whose (notoriously difficult) asymptotic 
behavior had been found as a result of that connection. 
After doing the subtraction we obtain 

oc(L) = ^ + L(/_ 1 (L + l)-/_ 1 (L)) + i + ^ + i(/ 1 (L + l)-/ 1 (L)) + 0(L- 1 ) 



logX 



+ /3(L) + 0(L- i ), (5) 



where again X = Kj(K — 1), and we have written 



P(L) = + 1) - U{L)) + ~ + -J- + \{h{L + 1) - f^L)). (6) 

2 log A 2 

But we have 

L(/_ 1 (L+l)-/_ 1 (L)) = ^^Kfr(-l-27r^/logm)exp(2 7 r^log m L)^')+0(L- 1 ), 

logm^ \ logm/ 

whereas 

/ 1 (L + 1)-/ 1 (L) = 0(L- 1 ). 
Therefore for /3(L) in dSJ) we can take 

Therefore the mean number of rounds that are necessary to guess all of the letters of an 
L letter word, the letters coming from an alphabet of K letters, is given exactly by (J2J), and 
is asymptotically 

logL 



, h +P(L) + 0(L- 1 ) (L-xjo). (8) 

lo g(^r) 

with 0(L) being the periodic function of period 1 in log x L that is given by ([7]) and X = 
Kj (K—l). If terms of order L _1 and the small constant 0.000002 mentioned in the statement 
of the theorem are ignored, this is 

We conclude with a comment on the oscillatory behavior of this mean, as revealed by 
the exact expression in (J7|). In probability theory the asymptotic behavior of a maximum of 
several iid random variables is often found by "sandwiching" the discrete random variable 
between two continuous random variables whose asymptotic behavior is known. In the case of 
geometric random variables the appropriate continuous random variables to be used for this 
purpose have negative exponential distributions. This sandwiching procedure has been used 
frequently and leads to the expression (Q, but does not lead to the more precise oscillatory 
behavior exhibited in the expression (J7J). 
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